In this hands-on session, we will show how principal component analysis (PCA) is used in common bioinformatics workflows. We will use a toy single-cell RNA-seq (scRNA-seq) dataset and a real NCI microarray data to illustrate it.
The main objectives are the following:
prcomp function.Analyzing tabular datasets, specially high-dimensional data, involves identifying which features carry more information and whether some features can be recovered from other features. Principal component analysis (PCA) is a dimensionality reduction method that transforms high-dimensions data into lower-dimensions while retaining as much information as possible.
We will work tabular data with numerical entries, using an \(n x m\) matrix to represent feature values (columns) for experimental samples (rows).
PCA offers several advantages when analyzing biological data:
Single-cell RNA-seq allows the gene expression profiling of thousands of cells, one a time. It is often exemplified using the following analogy:
scRNA-seq is a double-edged sword. On the one hand, scRNA-seq provides unprecedented discriminative power to chart all the cell types and states in a given tissue. This has catalyzed the creation of the Human Cell Atlas (HCA), which aims to classify all cells in the human body using scRNA-seq; and has been coined as āthe periodic table of human cellsā.
On the other hand, however, scRNA-seq has a high degree of technical variability, which can be introduced at multiple points of the process. Some examples include the following (Definitions extracted from table 1 of this paper):
As an example, we will create an expression matrix with the following cells:
In the following image you can visualize the redundancy between CD79A and CD79B. Together, they form the heterodimer CD79; which means that for every molecule of CD79A we need one of CD79B. Under the hood they are the same variable, which should be captured by a single principal component.
Image obtained from this link
Load packages
library(pheatmap)
library(tidyverse)
We will create a toy dataset that contains the gene expression of each of the cells. Cells are represented as rows and genes as columns.
# Create toy dataset
toy <- data.frame(
CD3D = c(4, 5, 4, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
CD3E = c(5, 3, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
LYZ = c(0, 0, 0, 0, 5, 7, 6, 0, 0, 0, 0, 0, 0, 0),
S100A8 = c(0, 0, 0, 0, 5, 7, 6, 0, 0, 0, 0, 0, 0, 0),
CD79A = c(0, 0, 0, 0, 0, 0, 0, 6, 4, 4, 3, 5, 0, 0),
CD79B = c(0, 0, 0, 0, 0, 0, 0, 5, 5, 3, 4, 6, 0, 0),
BLNK = c(0, 0, 0, 0, 0, 0, 0, 6, 4, 5, 5, 4, 0, 0),
TOP2A = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 9, 0, 0),
MKI67 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11, 10, 0, 0),
MT_CO1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 18, 23),
MT_ND5 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 22, 25)
)
rownames(toy) <- paste("cell", as.character(1:nrow(toy)))
toy <- as.matrix(toy)
# Visualize
pheatmap(toy, cluster_cols = FALSE, cluster_rows = FALSE, angle_col = 45)
Two ways to perform PCA in R:
prcomp function.Regardless of the method, we always need to normalize the data (center and scale). We will do it by using the function scale:
toy_normalized <- scale(toy, center = TRUE, scale = TRUE)
toy_normalized[1:5,1:5]
## CD3D CD3E LYZ S100A8 CD79A
## cell 1 1.4913495 2.0133218 -0.4974087 -0.4974087 -0.6899925
## cell 2 2.0133218 0.9693771 -0.4974087 -0.4974087 -0.6899925
## cell 3 1.4913495 1.4913495 -0.4974087 -0.4974087 -0.6899925
## cell 4 0.9693771 1.4913495 -0.4974087 -0.4974087 -0.6899925
## cell 5 -0.5965398 -0.5965398 1.4369585 1.4369585 -0.6899925
Now we will compute the exact same thing (PCA) in the 2 ways.
In each case we will be tracking the same three pieces of information:
We will be computing an eigenbasis for the covariance matrix
\(\Omega = A^tA\) where \(A\) is the normalized data (centered and scaled). We will use the function eigen.
The eigenvectors of the covariance matrix are the loadings (principal components):
cov_matrix <- (t(toy_normalized) %*% toy_normalized)
eigen_cov_matrix <- eigen(cov_matrix)
pdir_eigen <- eigen_cov_matrix$vectors
pdir_eigen[1:5,1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.2229758 -0.50005276 -0.17225849 -0.08980263 0.1673922
## [2,] -0.2229758 -0.50005276 -0.17225849 -0.08980263 0.1673922
## [3,] -0.1704618 0.48788427 -0.31105536 -0.08464816 0.1492861
## [4,] -0.1704618 0.48788427 -0.31105536 -0.08464816 0.1492861
## [5,] 0.4308542 -0.01396106 -0.01460081 0.32621575 0.3088705
The eigenvalues contains the variance of the data along each principal component.
var_eigen <- eigen_cov_matrix$values
head(var_eigen)
## [1] 59.6132013 33.9501945 31.0238660 15.9177700 1.0317758 0.8173653
The scores of each data sample in the principal components can be obtained by multiplying the original matrix with the matrix with the eigenvectors as columns:
scores_eigen <- toy_normalized %*% pdir_eigen
scores_eigen[1:5, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## cell 1 -1.700709 -2.277482 -0.7241901 -0.2151054 0.10259334
## cell 2 -1.584322 -2.016468 -0.6342760 -0.1682310 0.01521924
## cell 3 -1.584322 -2.016468 -0.6342760 -0.1682310 0.01521924
## cell 4 -1.467934 -1.755454 -0.5443618 -0.1213565 -0.07215486
## cell 5 -1.312695 1.959136 -1.1183532 -0.1207163 -0.10622519
Now we will use a stand-alone PCA function provided by R, prcomp, to calculate the principal components.
Using scale = TRUE we standardize the variables by converting the mean to zero and the standard deviation to one, directly within the function prcomp .
pca_out <- prcomp(toy, center = TRUE, scale = TRUE)
We can use the following keywords to parse the output instance:
Principal components (loadings): ārotationā
pdir_pca <- pca_out$rotation
pdir_pca[1:5,1:5]
## PC1 PC2 PC3 PC4 PC5
## CD3D -0.2229758 -0.50005276 0.17225849 0.08980263 0.1673922
## CD3E -0.2229758 -0.50005276 0.17225849 0.08980263 0.1673922
## LYZ -0.1704618 0.48788427 0.31105536 0.08464816 0.1492861
## S100A8 -0.1704618 0.48788427 0.31105536 0.08464816 0.1492861
## CD79A 0.4308542 -0.01396106 0.01460081 -0.32621575 0.3088705
Standard deviations : āsdevā
var_pca <- pca_out$sdev ** 2
head(var_pca)
## [1] 4.58563087 2.61155343 2.38645123 1.22444385 0.07936737 0.06287425
Principal components scores: āxā
scores_pca <- pca_out$x
scores_pca[1:5,1:5]
## PC1 PC2 PC3 PC4 PC5
## cell 1 -1.700709 -2.277482 0.7241901 0.2151054 0.10259334
## cell 2 -1.584322 -2.016468 0.6342760 0.1682310 0.01521924
## cell 3 -1.584322 -2.016468 0.6342760 0.1682310 0.01521924
## cell 4 -1.467934 -1.755454 0.5443618 0.1213565 -0.07215486
## cell 5 -1.312695 1.959136 1.1183532 0.1207163 -0.10622519
In summary, we obtain the following:
pca_out$rotationpca_out$sdev ** 2pca_out$xWhen using prcomp function can get the proportion of variance explained (PVE) and the cumulative proportion with the summary function (more info in this book):
summary(pca_out)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.1414 1.6160 1.5448 1.1065 0.28172 0.25075 0.19522
## Proportion of Variance 0.4169 0.2374 0.2170 0.1113 0.00722 0.00572 0.00346
## Cumulative Proportion 0.4169 0.6543 0.8712 0.9826 0.98977 0.99548 0.99895
## PC8 PC9 PC10 PC11
## Standard deviation 0.09752 0.04135 0.01862 1.383e-16
## Proportion of Variance 0.00086 0.00016 0.00003 0.000e+00
## Cumulative Proportion 0.99981 0.99997 1.00000 1.000e+00
To reduce the dimensionality, we will choose the number of PC that explains most of the variance in the dataset. To this end, we use a so-called elbow plot:
# Calculate percentage of variance explained (PVE):
pve <- pca_out$sdev ** 2 / sum(pca_out$sdev ** 2) * 100
pve <- round(pve, 2)
pve_df <- data.frame(principal_component = 1:length(pve), pve = pve)
pve_gg <- pve_df %>%
ggplot(aes(principal_component, pve)) +
geom_point() +
geom_vline(xintercept = 4.5, linetype = "dashed", color = "darkblue") +
scale_x_continuous(breaks = 1:length(pve)) +
labs(x = "Principal Component", y = "Percentage of Variance Explained (%)") +
theme_bw()
pve_gg
Some people prefer to plot the cumulative proportion of explained variance:
summ_pca <- summary(pca_out)
cum_pct <- summ_pca$importance["Cumulative Proportion", ] * 100
cum_pct_df <- data.frame(principal_component = 1:length(pve), cum_pct = cum_pct)
cum_pct_gg <- cum_pct_df %>%
ggplot(aes(principal_component, cum_pct)) +
geom_point() +
geom_vline(xintercept = 4.5, linetype = "dashed", color = "darkblue") +
scale_x_continuous(breaks = 1:length(cum_pct)) +
scale_y_continuous(limits = c(0, 100)) +
labs(x = "Principal Component", y = "Cumulative Percentage of Variance Explained (%)") +
theme_bw()
cum_pct_gg
Thus, we conclude that the first 4 PC are significant. Let us express each cell as a vector of the first 4 PC (scores):
toy_reduced <- pca_out$x[, c("PC1", "PC2", "PC3", "PC4")]
pheatmap(toy_reduced, cluster_rows = FALSE, cluster_cols = FALSE, angle_col = 45)
Now we have reduced the dimensionality of the dataset. To interpret each principal component, let us inspect the gene loadings:
gene_loads_reduced <- pca_out$rotation[, c("PC1", "PC2", "PC3", "PC4")]
significant_pcs <- c("PC1", "PC2", "PC3", "PC4")
loadings_gg <- map(significant_pcs, function(x) {
loadings <- gene_loads_reduced[, x]
df <- data.frame(gene = names(loadings), score = loadings)
p <- df %>%
ggplot(aes(fct_reorder(gene, score), score)) +
geom_point() +
labs(x = "", y = x) +
theme_bw() +
coord_flip()
return(p)
})
loadings_gg
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
Clustering means classifying observations into groups that minimize and maximize intra-group and inter-group distances, respectively.
To that end we need a concept of distance, which basically measures how similar or different two vectors are. In our case we will use Euclidean distance, but be aware that there are a plethora of different options that can yield different clustering results.
Let us start by computing all pairwise distances (distance between cells, considering their expression profiles):
dist_mat <- dist(toy_reduced, method = "euclidean")
pheatmap(dist_mat, cluster_rows = FALSE, cluster_cols = FALSE)
hclust_average <- hclust(dist_mat, method = "average")
plot(
hclust_average,
labels = rownames(toy),
main = "Average Linkage",
xlab = "",
sub = "",
ylab = "Cophenetic distance"
)
plot(
hclust_average,
labels = rownames(toy),
main = "Average Linkage",
xlab = "",
sub = "",
ylab = "Cophenetic distance"
)
abline(h = 2.5, col = "red")
Visualize the correspondent cluster to each of the cells:
hc_clusters <- cutree(hclust_average, h = 2.5)
hc_clusters
## cell 1 cell 2 cell 3 cell 4 cell 5 cell 6 cell 7 cell 8 cell 9 cell 10
## 1 1 1 1 2 2 2 3 3 3
## cell 11 cell 12 cell 13 cell 14
## 4 4 5 5
Visualize the number of cells in each cluster:
table(hc_clusters)
## hc_clusters
## 1 2 3 4 5
## 4 3 3 2 2
| Cluster | Cell type |
|---|---|
| 1 | T-cells |
| 2 | Monocytes |
| 3 | Naive B-cells |
| 4 | Plasma Cells |
| 5 | poor-quality cells |
Visualize the centered gene expression across cells and their corresponding cluster:
annotation_rows <- data.frame(cluster = as.character(hc_clusters))
rownames(annotation_rows) <- names(hc_clusters)
annotated_clusters <- c("T-cells", "Monocytes", "Naive B-cells", "Plasma Cells", "poor-quality cells")
names(annotated_clusters) <- as.character(1:5)
annotation_rows$annotation <- annotated_clusters[annotation_rows$cluster]
pheatmap(
toy_normalized,
cluster_rows = FALSE,
cluster_cols = FALSE,
angle_col = 315,
annotation_row = annotation_rows
)
We will use the data from NCI 60 Data from the ISLR package. This data set contains NCI microarray data. The data contains expression levels on 6830 genes from 64 cancer cell lines.
Exercise
Find the principal components of this dataset and cluster the cell lines according to their similarities in gene expression.
We will load the data and keep the information of the table and the tumor types of the cell lines:
library(ISLR)
data_NCI60 <- NCI60$data
labels_NCI60 <- NCI60$labs
rownames(data_NCI60) <- labels_NCI60
We can look at the dimension of the data set which contains information about 64 cell lines of different tumor types and 6830 genes.
dim(data_NCI60)
## [1] 64 6830
We can see the list of the tumor types:
labels_NCI60
## [1] "CNS" "CNS" "CNS" "RENAL" "BREAST"
## [6] "CNS" "CNS" "BREAST" "NSCLC" "NSCLC"
## [11] "RENAL" "RENAL" "RENAL" "RENAL" "RENAL"
## [16] "RENAL" "RENAL" "BREAST" "NSCLC" "RENAL"
## [21] "UNKNOWN" "OVARIAN" "MELANOMA" "PROSTATE" "OVARIAN"
## [26] "OVARIAN" "OVARIAN" "OVARIAN" "OVARIAN" "PROSTATE"
## [31] "NSCLC" "NSCLC" "NSCLC" "LEUKEMIA" "K562B-repro"
## [36] "K562A-repro" "LEUKEMIA" "LEUKEMIA" "LEUKEMIA" "LEUKEMIA"
## [41] "LEUKEMIA" "COLON" "COLON" "COLON" "COLON"
## [46] "COLON" "COLON" "COLON" "MCF7A-repro" "BREAST"
## [51] "MCF7D-repro" "BREAST" "NSCLC" "NSCLC" "NSCLC"
## [56] "MELANOMA" "BREAST" "BREAST" "MELANOMA" "MELANOMA"
## [61] "MELANOMA" "MELANOMA" "MELANOMA" "MELANOMA"
Solution
We will use the function prcomp to calculate the principal components.
pr_output <- prcomp(data_NCI60, center =TRUE, scale = TRUE)
As in the previous example, we can calculate the loadings:
pdir_pca <- pr_output$rotation
pdir_pca[1:6,1:6]
## PC1 PC2 PC3 PC4 PC5 PC6
## 1 -0.010682370 0.001324406 0.008503514 -0.003524094 -0.010126893 0.028903572
## 2 -0.002312078 0.001675266 0.010256593 0.002603645 -0.011400802 0.011242885
## 3 -0.005879750 -0.006289434 0.010055415 -0.010681458 0.010264980 0.018449224
## 4 0.003278071 0.002666138 0.008361513 -0.007475761 0.011248268 0.005553433
## 5 -0.007677535 -0.002508097 0.013820836 0.009509144 0.004094756 -0.002731900
## 6 0.002266671 -0.009677933 0.010818283 -0.012751147 -0.007196820 0.010587841
The standard deviations :
var_pca <- pr_output$sdev ** 2
head(var_pca)
## [1] 775.8157 461.4486 392.8508 290.1080 255.0986 247.1524
And the principal components scores
scores_pca <- pr_output$x
scores_pca[1:6,1:6]
## PC1 PC2 PC3 PC4 PC5 PC6
## CNS -19.68245 3.527748 -9.7354382 0.8177816 -12.511081 7.4129037
## CNS -22.90812 6.390938 -13.3725378 -5.5911088 -7.972471 3.6860385
## CNS -27.24077 2.445809 -3.5053437 1.3311502 -12.466296 17.2088846
## RENAL -42.48098 -9.691742 -0.8830921 -3.4180227 -41.938370 27.0251739
## BREAST -54.98387 -5.158121 -20.9291076 -15.7253986 -10.361364 12.8891587
## CNS -26.96488 6.727122 -21.6422924 -13.7323153 7.934827 0.7073495
We will plot the first 2 principal components score vectors with a dot plot, in order to visualize the data. We will color the observations (cell lines) according to their cancer type. In this way we can see to what extent the observations within a cancer type are similar to each other.
scores <- as.data.frame.array(pr_output$x)
scores$tumor_type <- labels_NCI60
scores %>%
ggplot(aes(x = PC1, y = PC2, color = labels_NCI60)) +
geom_point() +
labs(x = "PC1", y = "PC2") +
theme_classic()
summary(pr_output)$importance[,c(1:14)]
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 27.85347 21.48136 19.82046 17.03256 15.97181 15.72108
## Proportion of Variance 0.11359 0.06756 0.05752 0.04248 0.03735 0.03619
## Cumulative Proportion 0.11359 0.18115 0.23867 0.28115 0.31850 0.35468
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 14.47145 13.54427 13.14400 12.73860 12.68672 12.15769
## Proportion of Variance 0.03066 0.02686 0.02529 0.02376 0.02357 0.02164
## Cumulative Proportion 0.38534 0.41220 0.43750 0.46126 0.48482 0.50646
## PC13 PC14
## Standard deviation 11.83019 11.62554
## Proportion of Variance 0.02049 0.01979
## Cumulative Proportion 0.52695 0.54674
As in the previous example we will represent the cumulative proportion of variance explained:
pve <- pr_output$sdev ** 2 / sum(pr_output$sdev ** 2) * 100
pve <- round(pve, 2)
pve_df <- data.frame(principal_component = 1:length(pve), pve = pve)
summ_pca <- summary(pr_output)
cum_pct <- summ_pca$importance["Cumulative Proportion", ] * 100
cum_pct_df <- data.frame(principal_component = 1:length(pve), cum_pct = cum_pct)
cum_pct_gg <- cum_pct_df %>%
ggplot(aes(principal_component, cum_pct)) +
geom_point() +
scale_x_continuous(breaks = 1:length(cum_pct)) +
scale_y_continuous(limits = c(0, 100)) +
labs(x = "Principal Component", y = "Cumulative Percentage of Variance Explained (%)") +
theme_bw()
cum_pct_gg
In this case, we will select the first 8 PC. We can cluster the samples within the heatmap:
pheatmap(scores[, c("PC1", "PC2", "PC3","PC4","PC5", "PC6", "PC7","PC8")], cluster_rows = TRUE, cluster_cols = FALSE, angle_col = 45, labels_row = labels_NCI60, fontsize_row = 5)
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Madrid
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ISLR_1.4 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.0
## [5] dplyr_1.1.3 purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
## [9] tibble_3.2.1 ggplot2_3.4.4 tidyverse_2.0.0 pheatmap_1.0.12
## [13] BiocStyle_2.28.1
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.7 utf8_1.2.3 generics_0.1.3
## [4] stringi_1.7.12 hms_1.1.3 digest_0.6.33
## [7] magrittr_2.0.3 evaluate_0.22 grid_4.3.1
## [10] timechange_0.2.0 RColorBrewer_1.1-3 bookdown_0.35
## [13] fastmap_1.1.1 jsonlite_1.8.7 BiocManager_1.30.22
## [16] fansi_1.0.5 scales_1.2.1 jquerylib_0.1.4
## [19] cli_3.6.1 rlang_1.1.1 munsell_0.5.0
## [22] withr_2.5.1 cachem_1.0.8 yaml_2.3.7
## [25] tools_4.3.1 tzdb_0.4.0 colorspace_2.1-0
## [28] vctrs_0.6.4 R6_2.5.1 lifecycle_1.0.3
## [31] pkgconfig_2.0.3 pillar_1.9.0 bslib_0.5.1
## [34] gtable_0.3.4 glue_1.6.2 xfun_0.40
## [37] tidyselect_1.2.0 rstudioapi_0.15.0 knitr_1.44
## [40] farver_2.1.1 htmltools_0.5.6.1 rmarkdown_2.25
## [43] labeling_0.4.3 compiler_4.3.1